Homework_4_Disease Prediction

Yueyuan He

04/21/2020

Summary

  • I can get pretty results from this dataset after I processed the data, the best result is close to 74%
  • Although the performance of these models are close, nerual network and decision tree would be faster than logistic regression a lot
  • By features engineering, high blood pressure is the major factor which affects a person gets disease whether or not
  • If to add more hidden layers, neural network could get a better performance

Introduction

Background

This dataset is to classify whether or not a patient has a certain unspecified disease according to his/her part of information. I used decision tree, logistic regression, multiple artificial neural networks and deep learning models to solve this problem. After validating model performance and tuning model.Finally, got the prediction for each model.

Dataset

Attributes’ information about the dataset (Disease Prediction Training.csv):

  • Age: in years
  • Gender: male/female
  • Height: in unit of cm
  • Weight: in unit of kg
  • Low Blood Pressure: lower bound of blood pressure measurement
  • High Blood Pressure: higher bound of blood pressure measurement
  • Cholesterol: three cholesteral levels
  • Glucose: three glucose levels
  • Smoke: 1/0 regarding if the patient smokes
  • Alcohol: 1/0 regarding if the patient drinks alcohol
  • Exercise: 1/0 regarding if the patient exercises regularly
  • Disease: The binary target variable. Does the patient have the disease?

Data Exploration

Package importing

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import GridSearchCV, KFold,train_test_split,cross_val_score, cross_validate, ShuffleSplit, LeaveOneOut
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
np.random.seed(33)

Data Import

In [2]:
df = pd.read_csv("./Disease Prediction Training.csv")
In [3]:
df.head()
Out[3]:
Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise Disease
0 59 female 167 88.0 130 68 normal normal 0 0 1 0
1 64 female 150 71.0 140 100 normal normal 0 0 0 1
2 41 female 166 83.0 100 70 normal normal 0 1 1 0
3 50 male 172 110.0 130 80 normal normal 1 0 1 0
4 39 female 162 61.0 110 80 high high 0 0 1 0

Explore Data Analysis

Data Size and Types

In [4]:
df.shape
Out[4]:
(49000, 12)
In [5]:
df.dtypes
Out[5]:
Age                      int64
Gender                  object
Height                   int64
Weight                 float64
High Blood Pressure      int64
Low Blood Pressure       int64
Cholesterol             object
Glucose                 object
Smoke                    int64
Alcohol                  int64
Exercise                 int64
Disease                  int64
dtype: object

Data Missing Value

In [6]:
df.isnull().sum()
Out[6]:
Age                    0
Gender                 0
Height                 0
Weight                 0
High Blood Pressure    0
Low Blood Pressure     0
Cholesterol            0
Glucose                0
Smoke                  0
Alcohol                0
Exercise               0
Disease                0
dtype: int64

Data Value Analysis

In [7]:
df.describe()
Out[7]:
Age Height Weight High Blood Pressure Low Blood Pressure Smoke Alcohol Exercise Disease
count 49000.000000 49000.000000 49000.000000 49000.000000 49000.000000 49000.000000 49000.000000 49000.000000 49000.000000
mean 52.853306 164.366878 74.190527 128.698939 96.917367 0.088265 0.054245 0.803204 0.499959
std 6.763065 8.216637 14.329934 147.624582 200.368069 0.283683 0.226503 0.397581 0.500005
min 29.000000 55.000000 10.000000 -150.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 48.000000 159.000000 65.000000 120.000000 80.000000 0.000000 0.000000 1.000000 0.000000
50% 53.000000 165.000000 72.000000 120.000000 80.000000 0.000000 0.000000 1.000000 0.000000
75% 58.000000 170.000000 82.000000 140.000000 90.000000 0.000000 0.000000 1.000000 1.000000
max 64.000000 207.000000 200.000000 14020.000000 11000.000000 1.000000 1.000000 1.000000 1.000000
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49000 entries, 0 to 48999
Data columns (total 12 columns):
Age                    49000 non-null int64
Gender                 49000 non-null object
Height                 49000 non-null int64
Weight                 49000 non-null float64
High Blood Pressure    49000 non-null int64
Low Blood Pressure     49000 non-null int64
Cholesterol            49000 non-null object
Glucose                49000 non-null object
Smoke                  49000 non-null int64
Alcohol                49000 non-null int64
Exercise               49000 non-null int64
Disease                49000 non-null int64
dtypes: float64(1), int64(8), object(3)
memory usage: 4.5+ MB
In [9]:
df.head()
Out[9]:
Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise Disease
0 59 female 167 88.0 130 68 normal normal 0 0 1 0
1 64 female 150 71.0 140 100 normal normal 0 0 0 1
2 41 female 166 83.0 100 70 normal normal 0 1 1 0
3 50 male 172 110.0 130 80 normal normal 1 0 1 0
4 39 female 162 61.0 110 80 high high 0 0 1 0

Data Outlier Detection

In [10]:
ax = sns.distplot(df["Age"])
In [11]:
fig = px.violin(df, y="High Blood Pressure", x="Glucose", color="Gender", box=True, points="all",
          hover_data=df.columns)
fig.show()
In [12]:
fig = px.violin(df, y="Low Blood Pressure", x="Glucose", color="Gender", box=True, points="all",
          hover_data=df.columns)
fig.show()
In [13]:
fig = px.violin(df, y="Height", x="Cholesterol", color="Gender", box=True, points="all",
          hover_data=df.columns)
fig.show()
In [14]:
fig = px.violin(df, y="Age", x="Glucose", color="Gender", box=True, points="all",
          hover_data=df.columns)
fig.show()

Data Preprocessing

Outlier Processing

In [16]:
def blood_pressure(x):
    if x < 0:
        x = abs(x)
    elif x > 0 and x <30:
        x = x * 10
    elif x> 300 and x <= 2000:
        x = int(x/10)
    elif x > 2000:
        x = int(x/100)
    else:
        x = x

    return x 
In [17]:
df["Low Blood Pressure"] = df["Low Blood Pressure"].apply(blood_pressure)
df["High Blood Pressure"] = df["High Blood Pressure"].apply(blood_pressure)
In [18]:
df = df[df['Low Blood Pressure'] > 0]
df["Low Blood Pressure"]= df.apply(lambda x: 90 if x["Low Blood Pressure"]>90 else x, axis=1)
In [19]:
def CheckHigh(x):
    if x <120:
        x = 120
        return x
    else:
        return x
In [20]:
df["High Blood Pressure"]= df["High Blood Pressure"].apply(CheckHigh)
In [21]:
def switchA(a,b):
    c = 0
    if a < b:
        c = b
        a = c
        return a
    else:
        return a
In [22]:
df["High Blood Pressure"] = df.apply(lambda x: switchA(x["High Blood Pressure"],x["Low Blood Pressure"]),axis=1)
In [23]:
ax = sns.distplot(df["High Blood Pressure"])
In [24]:
ax = sns.distplot(df["Low Blood Pressure"])

Catorical to Numerical

In [25]:
catorical_list = ['Gender','Cholesterol','Glucose']
In [26]:
def cat2num(df):
    for i in catorical_list:
        df = df.join(pd.get_dummies(df[i],prefix=i+'_'))
        df = df.drop([i],axis=1)
    return df
In [27]:
df =cat2num(df)

Splitting Data

In [28]:
X = df.drop(['Disease'],axis = 1)
y = df['Disease']
In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=33)
In [30]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)

scaler.fit(X)
X_std = scaler.transform(X)

Conclusion

I found this dataset doesn't have any missing value, but its bloodpressure has a lot of problems by violin plots, some of them are over reasonable value, so I used apply function in pandas to make them more sense, and set the highest value of LowBloodPressure is 90, the lowest value of HighBloodPressure is 120 to make sure bloodpressure is in a reasonable and scientific range. Then making catorical data into numerical by one hot encoding, and splitted this dataset into training data and testing data.

Model

Model Performance Function

In [31]:
def model_performance(y_test, y_pred):
    print(classification_report(y_test, y_pred, target_names=["Disease","No Disease"]))
In [32]:
def roc_chart(model,label):    
    y_predict_proba = model.predict_proba(X_test_std)
    fpr, tpr, threshold = roc_curve(y_test, pd.DataFrame(y_predict_proba)[1])
    plt.plot([0,1],[0,1],'r--',label='Ramdom Guess')
    plt.plot(fpr, tpr, linestyle='solid', linewidth=2, label=label)
    plt.xlabel("FPR")
    plt.ylabel("TPR")
    plt.legend()
    plt.title(label+' ROC')
    plt.show()
    print('AUC is:',auc(fpr, tpr))
In [99]:
def ann_plot(model):
    fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,4))
    ax1.plot(model.history['loss'])
    ax1.set_title('model loss',loc='center')
    ax1.set_ylabel('loss ')
    ax1.set_xlabel('epoch')
    ax1.legend(['train', 'test'], loc='upper left')
    
    ax2.plot(model.history['binary_accuracy'])
    ax2.set_title('binary_accuracy',loc='center')
    ax2.set_ylabel('binary_accuracy')
    ax2.set_xlabel('epoch')
    ax2.legend(['train', 'test'], loc='upper left')
#    plt.tight_layout()
    plt.show()

Logistics Regression

Baseline Model

In [34]:
logr_pipe = make_pipeline(StandardScaler(), LogisticRegression(random_state=33))
logr_pipe.fit(X_train_std, y_train)
Out[34]:
Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=33,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)
In [35]:
y_pred = logr_pipe.predict(X_test_std)
In [36]:
model_performance(y_test, y_pred)
              precision    recall  f1-score   support

     Disease       0.70      0.81      0.75      7329
  No Disease       0.78      0.65      0.70      7368

    accuracy                           0.73     14697
   macro avg       0.74      0.73      0.73     14697
weighted avg       0.74      0.73      0.73     14697

Model Tuning

In [37]:
lr_time = time.time()
cv = KFold(n_splits=10, shuffle=True, random_state=33)
from sklearn.model_selection import GridSearchCV
lr_param_grid = {'penalty': ['l1','l2'],
              'solver': ['liblinear','saga'],
                 'C': np.linspace(0.01,1.01,9)}
grid = GridSearchCV(LogisticRegression(random_state=33)
                        , lr_param_grid,cv=cv)
lr_grid = grid.fit(X_train_std,y_train)
y_pred = lr_grid.predict(X_test_std)
best_lr = lr_grid.best_estimator_
lr_cost = time.time()-lr_time
print(lr_grid.best_score_)
print(lr_grid.best_params_)
print(f'the best model of Logistic regression needs {time.strftime("%H:%M:%S", time.gmtime(lr_cost))} to run.')
0.7282376947779486
{'C': 0.01, 'penalty': 'l2', 'solver': 'saga'}
the best model of Logistic regression needs 00:02:00 to run.
In [38]:
roc_chart(lr_grid.best_estimator_,'Logistic Regression Grid')
AUC is: 0.7970669927995652
In [39]:
model_performance(y_test, y_pred)
              precision    recall  f1-score   support

     Disease       0.70      0.81      0.75      7329
  No Disease       0.78      0.65      0.71      7368

    accuracy                           0.73     14697
   macro avg       0.74      0.73      0.73     14697
weighted avg       0.74      0.73      0.73     14697

In [40]:
pd.DataFrame({'Features':X.columns,'Coefficients':best_lr.coef_.ravel()}).sort_values("Coefficients", ascending=False)
Out[40]:
Features Coefficients
3 High Blood Pressure 1.037595
0 Age 0.380110
12 Cholesterol__too high 0.241950
2 Weight 0.150784
14 Glucose__normal 0.035756
13 Glucose__high 0.022216
4 Low Blood Pressure 0.008901
9 Gender__male 0.000795
1 Height -0.000051
8 Gender__female -0.000795
10 Cholesterol__high -0.015573
5 Smoke -0.034817
6 Alcohol -0.044351
15 Glucose__too high -0.070009
7 Exercise -0.081776
11 Cholesterol__normal -0.164857

Conclusion

By tuning logistic regression hyper-parameters, such as solver, penalty and C,I got a best logistic regression model for this dataset. The ROC and AUC shows a good result, in addition, High Blood Pressure, Age, Cholesterol__too high are the top 3 factors which affect if a person got disease or not.

Multiple Artificial Neural Networks

Summary

I set the loss function is binary_crossentropy and metrcs is binary_accuracy because this problem is a binary classfication, so using binary_crossentropy and binary_accuracy can help get a good performance. I didn't do a regularization because all of the three model did not overfit according to evaluate training data.

ANN0

In [41]:
from tensorflow import keras
from tensorflow.keras import models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
#from tensorflow.keras.regularizers import l2

ANN0_model = Sequential()
ANN0_model.add(Dense(1, activation='sigmoid',input_shape=(len(X_train.columns),)))
ANN0_model.output_shape
/Users/mark/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

/Users/mark/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

/Users/mark/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

/Users/mark/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

/Users/mark/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

/Users/mark/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

/Users/mark/anaconda3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

/Users/mark/anaconda3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

/Users/mark/anaconda3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

/Users/mark/anaconda3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

/Users/mark/anaconda3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

/Users/mark/anaconda3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

Out[41]:
(None, 1)
In [42]:
np.__version__
Out[42]:
'1.17.4'

The reason why I got a lot of warnings is because the version of numpy is lastest, it would disappear if the version of numpy is under 1.16. reference is here: https://github.com/tensorflow/tensorflow/issues/30427

In [43]:
ANN0_model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 1)                 17        
=================================================================
Total params: 17
Trainable params: 17
Non-trainable params: 0
_________________________________________________________________
In [95]:
ANN0_model.compile(loss='binary_crossentropy'
                   , optimizer='sgd'
                   , metrics=['binary_accuracy'
                   ])
ANN0_model.fit(X_train_std, y_train, epochs=10, batch_size=128, verbose=1)
Train on 34291 samples
Epoch 1/10
34291/34291 [==============================] - 0s 7us/sample - loss: 0.5581 - binary_accuracy: 0.7288
Epoch 2/10
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5581 - binary_accuracy: 0.7282
Epoch 3/10
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5581 - binary_accuracy: 0.7282
Epoch 4/10
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5581 - binary_accuracy: 0.7284
Epoch 5/10
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5581 - binary_accuracy: 0.7281
Epoch 6/10
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5581 - binary_accuracy: 0.7282
Epoch 7/10
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5581 - binary_accuracy: 0.7285
Epoch 8/10
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5581 - binary_accuracy: 0.7281
Epoch 9/10
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5581 - binary_accuracy: 0.7280
Epoch 10/10
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5581 - binary_accuracy: 0.7283
Out[95]:
<tensorflow.python.keras.callbacks.History at 0x7fac5b630dd0>
In [101]:
start = time.time()
ANN0_model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['binary_accuracy'])
ANN0 = ANN0_model.fit(X_train_std, y_train, epochs=15, batch_size=128, verbose=1)
ann0_cost = time.time() - start
ann0_params= {'epochs':20,'batch_size':128,
              'loss':'binary_crossentropy',
              'optimizer':'RMSprop','metrics':'binary_accuracy'}
print(f'the best model of ANN0 needs {time.strftime("%H:%M:%S", time.gmtime(ann0_cost))} to run.')
Train on 34291 samples
Epoch 1/15
34291/34291 [==============================] - 0s 8us/sample - loss: 0.5583 - binary_accuracy: 0.7288
Epoch 2/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5583 - binary_accuracy: 0.7286
Epoch 3/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5583 - binary_accuracy: 0.7284
Epoch 4/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5582 - binary_accuracy: 0.7284
Epoch 5/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5583 - binary_accuracy: 0.7284
Epoch 6/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5582 - binary_accuracy: 0.7279
Epoch 7/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5583 - binary_accuracy: 0.7278
Epoch 8/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5582 - binary_accuracy: 0.7286
Epoch 9/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5582 - binary_accuracy: 0.7282
Epoch 10/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5583 - binary_accuracy: 0.7282
Epoch 11/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5583 - binary_accuracy: 0.7282
Epoch 12/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5583 - binary_accuracy: 0.7282
Epoch 13/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5582 - binary_accuracy: 0.7284
Epoch 14/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5582 - binary_accuracy: 0.7284
Epoch 15/15
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5583 - binary_accuracy: 0.7284
the best model of ANN0 needs 00:00:03 to run.
In [102]:
ann_plot(ANN0)
In [47]:
checkover = ANN0_model.evaluate(X_train_std, y_train, batch_size=128,verbose=1)
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5581 - binary_accuracy: 0.7289
In [48]:
ann0_score = ANN0_model.evaluate(X_test_std, y_test, batch_size=128)
14697/14697 [==============================] - 0s 5us/sample - loss: 0.5630 - binary_accuracy: 0.7295
In [49]:
y_pred_keras = ANN0_model.predict_proba(X_train_std).ravel()
fpr_keras, tpr_keras, thresholds_keras = roc_curve(y_train, y_pred_keras)
ANN0_score = auc(fpr_keras, tpr_keras)
print(ANN0_score)
plt.plot(fpr_keras, tpr_keras, label='Keras (area = {:.3f})'.format(ANN0_score))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.plot([0, 1], [0, 1], 'k--')
plt.title('ROC curve')
plt.legend(loc='best')
0.7964990015846224
Out[49]:
<matplotlib.legend.Legend at 0x7fac3943b790>

Conclusion

I changed input nodes to be more larger and smaller, I did not see anything improvement. And I tried to change optimization function, but I didn't found there is anything improvement and so epoch and batchsize too. By using evaluation function, the binary accuracy of this model is 0.7295. I used AUC to check this result, it is 0.7965.

ANN1

In [197]:
from tensorflow import keras
from tensorflow.keras import models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

ANN1_model = Sequential()
ANN1_model.add(Dense(16, activation='relu', input_shape=(len(X_train.columns), )))
ANN1_model.add(Dense(1, activation='sigmoid'))
ANN1_model.output_shape
Out[197]:
(None, 1)
In [198]:
ANN1_model.summary()
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_9 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 17        
=================================================================
Total params: 289
Trainable params: 289
Non-trainable params: 0
_________________________________________________________________
In [199]:
ANN1_model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['binary_accuracy'])
ANN1_model.fit(X_train_std, y_train, epochs=20, batch_size=128, verbose=0)
Out[199]:
<tensorflow.python.keras.callbacks.History at 0x7fac4bd78e10>
In [200]:
ANN1_model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['binary_accuracy'])
ANN1 = ANN1_model.fit(X_train_std, y_train, epochs=25, batch_size=128, verbose=1)
Train on 34291 samples
Epoch 1/25
34291/34291 [==============================] - 0s 9us/sample - loss: 0.5543 - binary_accuracy: 0.7252
Epoch 2/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5512 - binary_accuracy: 0.7278
Epoch 3/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5489 - binary_accuracy: 0.7293
Epoch 4/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5474 - binary_accuracy: 0.7312
Epoch 5/25
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5461 - binary_accuracy: 0.7324
Epoch 6/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5452 - binary_accuracy: 0.7330
Epoch 7/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5445 - binary_accuracy: 0.7339
Epoch 8/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5440 - binary_accuracy: 0.7337
Epoch 9/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5434 - binary_accuracy: 0.7342
Epoch 10/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5429 - binary_accuracy: 0.7348
Epoch 11/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5427 - binary_accuracy: 0.7354
Epoch 12/25
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5424 - binary_accuracy: 0.7349
Epoch 13/25
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5420 - binary_accuracy: 0.7343
Epoch 14/25
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5416 - binary_accuracy: 0.7344
Epoch 15/25
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5415 - binary_accuracy: 0.7349
Epoch 16/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5413 - binary_accuracy: 0.7352
Epoch 17/25
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5411 - binary_accuracy: 0.7351
Epoch 18/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5408 - binary_accuracy: 0.7357
Epoch 19/25
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5407 - binary_accuracy: 0.7350
Epoch 20/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5406 - binary_accuracy: 0.7357
Epoch 21/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5405 - binary_accuracy: 0.7353
Epoch 22/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5404 - binary_accuracy: 0.7360
Epoch 23/25
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5402 - binary_accuracy: 0.7347
Epoch 24/25
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5400 - binary_accuracy: 0.7354
Epoch 25/25
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5400 - binary_accuracy: 0.7351
In [213]:
ann_plot(ANN1)
In [217]:
start = time.time()
ANN1_model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['binary_accuracy'])
ANN1 = ANN1_model.fit(X_train_std, y_train, epochs=30, batch_size=128, verbose=1)
ann1_cost = time.time() - start
print(f'the best model of ANN1 needs {time.strftime("%H:%M:%S", time.gmtime(ann1_cost))} to run.')
ann1_params= {'epochs':30,'batch_size':128,
              'loss':'binary_crossentropy',
              'optimizer':'RMSprop','metrics':'binary_accuracy'}
Train on 34291 samples
Epoch 1/30
34291/34291 [==============================] - 0s 10us/sample - loss: 0.5371 - binary_accuracy: 0.7373
Epoch 2/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5371 - binary_accuracy: 0.7375
Epoch 3/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5371 - binary_accuracy: 0.7360
Epoch 4/30
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5370 - binary_accuracy: 0.7373
Epoch 5/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5369 - binary_accuracy: 0.7363
Epoch 6/30
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5372 - binary_accuracy: 0.7369
Epoch 7/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5371 - binary_accuracy: 0.7362
Epoch 8/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5370 - binary_accuracy: 0.7373
Epoch 9/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5371 - binary_accuracy: 0.7365
Epoch 10/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5370 - binary_accuracy: 0.7366
Epoch 11/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5369 - binary_accuracy: 0.7364
Epoch 12/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5370 - binary_accuracy: 0.7372
Epoch 13/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5369 - binary_accuracy: 0.7357
Epoch 14/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5370 - binary_accuracy: 0.7367
Epoch 15/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5370 - binary_accuracy: 0.7370
Epoch 16/30
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5370 - binary_accuracy: 0.7366
Epoch 17/30
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5369 - binary_accuracy: 0.7365
Epoch 18/30
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5369 - binary_accuracy: 0.7367
Epoch 19/30
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5370 - binary_accuracy: 0.7368
Epoch 20/30
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5370 - binary_accuracy: 0.7364
Epoch 21/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5370 - binary_accuracy: 0.7361
Epoch 22/30
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5370 - binary_accuracy: 0.7368
Epoch 23/30
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5369 - binary_accuracy: 0.7365
Epoch 24/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5368 - binary_accuracy: 0.7367
Epoch 25/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5370 - binary_accuracy: 0.7368
Epoch 26/30
34291/34291 [==============================] - 0s 5us/sample - loss: 0.5368 - binary_accuracy: 0.7355
Epoch 27/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5368 - binary_accuracy: 0.7356
Epoch 28/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5369 - binary_accuracy: 0.7370
Epoch 29/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5369 - binary_accuracy: 0.7365
Epoch 30/30
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5369 - binary_accuracy: 0.7364
the best model of ANN1 needs 00:00:06 to run.
In [215]:
checkover = ANN1_model.evaluate(X_train_std, y_train, batch_size=128)
34291/34291 [==============================] - 0s 4us/sample - loss: 0.5364 - binary_accuracy: 0.7363
In [216]:
ann1_score = ANN1_model.evaluate(X_test_std, y_test, batch_size=128)
14697/14697 [==============================] - 0s 5us/sample - loss: 0.5437 - binary_accuracy: 0.7375
In [218]:
y_pred_keras = ANN1_model.predict_proba(X_train_std).ravel()
fpr_keras, tpr_keras, thresholds_keras = roc_curve(y_train, y_pred_keras)
ANN1_score = auc(fpr_keras, tpr_keras)
print(ANN1_score)
plt.plot(fpr_keras, tpr_keras, label='Keras (area = {:.3f})'.format(ANN1_score))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.plot([0, 1], [0, 1], 'k--')
plt.title('ROC curve')
plt.legend(loc='best')
0.8060053479729762
Out[218]:
<matplotlib.legend.Legend at 0x7fac4c1dab50>

Conclusion

When I added one hidden layer, I changed the optimization into RMSprop Adadelta and adam, and I found there is a little improvement in ANN1 Model. When I changed the value of epoch, I found it would be overfitted if the value is more than 25, so I set 25 epoches for this model. In addition, the input node is 32 and the only one hidden layer is 32 can help me get a better performance according to tune. At last, the accuracy is 0.7375, and its AUC can reach 0.8060.

ANN2

In [225]:
from tensorflow import keras
from tensorflow.keras import models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

ANN2_model = Sequential()
ANN2_model.add(Dense(16, activation='relu', input_shape=(len(X_train.columns), )))
ANN2_model.add(Dense(8, activation='relu'))
ANN2_model.add(Dense(1, activation='sigmoid'))
ANN2_model.output_shape
Out[225]:
(None, 1)
In [226]:
ANN2_model.summary()
Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_14 (Dense)             (None, 16)                272       
_________________________________________________________________
dense_15 (Dense)             (None, 8)                 136       
_________________________________________________________________
dense_16 (Dense)             (None, 1)                 9         
=================================================================
Total params: 417
Trainable params: 417
Non-trainable params: 0
_________________________________________________________________
In [227]:
start = time.time()
ANN2_model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['binary_accuracy'])
ANN2 = ANN2_model.fit(X_train_std, y_train, epochs=50, batch_size=128, verbose=1)
ann2_cost = time.time() - start
print(f'the best model of ANN2 needs {time.strftime("%H:%M:%S", time.gmtime(ann2_cost))} to run.')
ann2_params= {'epochs':30,'batch_size':128,
              'loss':'binary_crossentropy',
              'optimizer':'adam','metrics':'binary_accuracy'}
Train on 34291 samples
Epoch 1/50
34291/34291 [==============================] - 0s 11us/sample - loss: 0.6159 - binary_accuracy: 0.6564
Epoch 2/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5661 - binary_accuracy: 0.7179
Epoch 3/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5522 - binary_accuracy: 0.7302
Epoch 4/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5479 - binary_accuracy: 0.7316
Epoch 5/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5462 - binary_accuracy: 0.7326
Epoch 6/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5451 - binary_accuracy: 0.7319
Epoch 7/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5444 - binary_accuracy: 0.7326
Epoch 8/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5439 - binary_accuracy: 0.7333
Epoch 9/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5433 - binary_accuracy: 0.7332
Epoch 10/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5429 - binary_accuracy: 0.7337
Epoch 11/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5427 - binary_accuracy: 0.7330
Epoch 12/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5422 - binary_accuracy: 0.7337
Epoch 13/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5420 - binary_accuracy: 0.7345
Epoch 14/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5416 - binary_accuracy: 0.7347
Epoch 15/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5414 - binary_accuracy: 0.7346
Epoch 16/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5412 - binary_accuracy: 0.7345
Epoch 17/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5409 - binary_accuracy: 0.7334
Epoch 18/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5408 - binary_accuracy: 0.7350
Epoch 19/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5405 - binary_accuracy: 0.7354
Epoch 20/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5404 - binary_accuracy: 0.7341
Epoch 21/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5402 - binary_accuracy: 0.7354
Epoch 22/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5401 - binary_accuracy: 0.7350
Epoch 23/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5400 - binary_accuracy: 0.7349
Epoch 24/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5397 - binary_accuracy: 0.7357
Epoch 25/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5395 - binary_accuracy: 0.7354
Epoch 26/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5395 - binary_accuracy: 0.7352
Epoch 27/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5392 - binary_accuracy: 0.7356
Epoch 28/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5392 - binary_accuracy: 0.7355
Epoch 29/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5391 - binary_accuracy: 0.7349
Epoch 30/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5389 - binary_accuracy: 0.7357
Epoch 31/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5388 - binary_accuracy: 0.7357
Epoch 32/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5388 - binary_accuracy: 0.7347
Epoch 33/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5387 - binary_accuracy: 0.7351
Epoch 34/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5387 - binary_accuracy: 0.7348
Epoch 35/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5385 - binary_accuracy: 0.7353
Epoch 36/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5382 - binary_accuracy: 0.7358
Epoch 37/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5383 - binary_accuracy: 0.7351
Epoch 38/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5383 - binary_accuracy: 0.7349
Epoch 39/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5382 - binary_accuracy: 0.7360
Epoch 40/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5380 - binary_accuracy: 0.7353
Epoch 41/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5382 - binary_accuracy: 0.7349
Epoch 42/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5380 - binary_accuracy: 0.7364
Epoch 43/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5379 - binary_accuracy: 0.7345
Epoch 44/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5377 - binary_accuracy: 0.7354
Epoch 45/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5377 - binary_accuracy: 0.7353
Epoch 46/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5377 - binary_accuracy: 0.7354
Epoch 47/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5376 - binary_accuracy: 0.7356
Epoch 48/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5376 - binary_accuracy: 0.7361
Epoch 49/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5374 - binary_accuracy: 0.7367
Epoch 50/50
34291/34291 [==============================] - 0s 6us/sample - loss: 0.5375 - binary_accuracy: 0.7360
the best model of ANN2 needs 00:00:10 to run.
In [228]:
ann_plot(ANN2)
In [229]:
checkover = ANN2_model.evaluate(X_train_std, y_train, batch_size=128)
34291/34291 [==============================] - 0s 4us/sample - loss: 0.5364 - binary_accuracy: 0.7372
In [230]:
ann2_score = ANN2_model.evaluate(X_test_std, y_test, batch_size=128)
14697/14697 [==============================] - 0s 5us/sample - loss: 0.5441 - binary_accuracy: 0.7380
In [231]:
y_pred_keras = ANN2_model.predict_proba(X_train_std).ravel()
fpr_keras, tpr_keras, thresholds_keras = roc_curve(y_train, y_pred_keras)
ANN2_score = auc(fpr_keras, tpr_keras)
print(ANN2_score)
plt.plot(fpr_keras, tpr_keras, label='Keras (area = {:.3f})'.format(ANN2_score))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.plot([0, 1], [0, 1], 'k--')
plt.title('ROC curve')
plt.legend(loc='best')
0.8058030299152328
Out[231]:
<matplotlib.legend.Legend at 0x7fac1a183b10>

Conclusion

In this model architecture, when I added two hidden layer, I found the first hidden is 16, the second layer is 8, epoch is 30, this model can get a better performance, and in the meanwhile, this model won't be overfitted. Its accuracy is 0.7380, and its AUC can reach 0.8058.

Decision Tree

Basline

In [67]:
dt_pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier())
dt_pipe.fit(X_train_std, y_train)
Out[67]:
Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features=None, max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        presort='deprecated', random_state=None,
                                        splitter='best'))],
         verbose=False)
In [68]:
y_pred = dt_pipe.predict(X_test_std)
In [69]:
model_performance(y_test, y_pred)
              precision    recall  f1-score   support

     Disease       0.63      0.65      0.64      7329
  No Disease       0.64      0.62      0.63      7368

    accuracy                           0.63     14697
   macro avg       0.63      0.63      0.63     14697
weighted avg       0.63      0.63      0.63     14697

In [70]:
dt_start = time.time()
param = {'criterion':['gini','entropy']
         ,'max_depth':range(2,8,1)
         ,'min_samples_leaf':range(1,10,2)
         ,'min_impurity_decrease':np.arange(0.01,0.5,0.05)
        }
kfold = KFold(n_splits=10)
grid = GridSearchCV(DecisionTreeClassifier(), param_grid=param, cv=kfold)
dt_grid = grid.fit(X_train_std,y_train)
print('best_param:',dt_grid.best_params_,'best_score:', dt_grid.best_score_) 
y_pred = dt_grid.predict(X_test_std)
best_dt = dt_grid.best_estimator_
dt_cost = time.time()-dt_start
print(f'the best model of Decision Tree needs {time.strftime("%H:%M:%S", time.gmtime(dt_cost))} to run.')
best_param: {'criterion': 'gini', 'max_depth': 2, 'min_impurity_decrease': 0.01, 'min_samples_leaf': 1} best_score: 0.71671848842024
the best model of Decision Tree needs 00:01:18 to run.
In [71]:
roc_chart(dt_grid,'Logistic Regression')
AUC is: 0.7693329557042072
In [72]:
model_performance(y_test, y_pred)
              precision    recall  f1-score   support

     Disease       0.68      0.81      0.74      7329
  No Disease       0.76      0.63      0.69      7368

    accuracy                           0.72     14697
   macro avg       0.72      0.72      0.72     14697
weighted avg       0.72      0.72      0.72     14697

In [73]:
dt_cof = pd.DataFrame({'Features':X.columns,'DT_Importance':best_dt.feature_importances_}).sort_values("DT_Importance", ascending=False)
dt_cof
Out[73]:
Features DT_Importance
3 High Blood Pressure 0.897399
4 Low Blood Pressure 0.102601
0 Age 0.000000
1 Height 0.000000
2 Weight 0.000000
5 Smoke 0.000000
6 Alcohol 0.000000
7 Exercise 0.000000
8 Gender__female 0.000000
9 Gender__male 0.000000
10 Cholesterol__high 0.000000
11 Cholesterol__normal 0.000000
12 Cholesterol__too high 0.000000
13 Glucose__high 0.000000
14 Glucose__normal 0.000000
15 Glucose__too high 0.000000

Compared with RF and GBT

In [74]:
rf_param_grid = {'n_estimators': range(80, 140,20),
                 'max_depth':range(1, 10),
              'max_features': range(3, 5)}
rf_grid = GridSearchCV(RandomForestClassifier(), rf_param_grid)
rf_grid = rf_grid.fit(X_train_std, y_train)
In [75]:
roc_chart(rf_grid,'Random Forest')
AUC is: 0.7995248691520263
In [76]:
rf_fi = pd.DataFrame({'Features':X.columns,'RF_Importance':rf_grid.best_estimator_.feature_importances_}).sort_values("RF_Importance", ascending=False)
In [77]:
param_grid = {'learning_rate': np.arange(0.02, 0.1, 0.01),
              'n_estimators': range(80, 100,20),
              'max_features':range(3, 8),
              'max_depth': range(2, 5)}
gbt_grid = GridSearchCV(GradientBoostingClassifier(), param_grid)
gbt_grid.fit(X_train_std, y_train)
Out[77]:
GridSearchCV(cv=None, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_n...
                                                  random_state=None,
                                                  subsample=1.0, tol=0.0001,
                                                  validation_fraction=0.1,
                                                  verbose=0, warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'learning_rate': array([0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09]),
                         'max_depth': range(2, 5), 'max_features': range(3, 8),
                         'n_estimators': range(80, 100, 20)},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [78]:
roc_chart(gbt_grid,'Gradient Boosting')
AUC is: 0.8015847089981658
In [79]:
gbt_fi = pd.DataFrame({'Features':X.columns,'GBT_Importance':gbt_grid.best_estimator_.feature_importances_}).sort_values("GBT_Importance", ascending=False)
In [80]:
feature_compared = pd.concat([dt_cof,rf_fi['RF_Importance'],gbt_fi['GBT_Importance']],axis=1)
feature_compared.sort_values('GBT_Importance',ascending=False).head()
Out[80]:
Features DT_Importance RF_Importance GBT_Importance
3 High Blood Pressure 0.897399 0.558514 0.581389
4 Low Blood Pressure 0.102601 0.158551 0.149307
0 Age 0.000000 0.076020 0.076166
11 Cholesterol__normal 0.000000 0.055171 0.074452
2 Weight 0.000000 0.043042 0.041281
In [81]:
compared_df = pd.DataFrame([dt_grid.best_score_],index=['Score'],columns=['DT'])
compared_df['RF'] = rf_grid.best_score_
compared_df['GBT'] = gbt_grid.best_score_
compared_df
Out[81]:
DT RF GBT
Score 0.716718 0.735062 0.735645

Conclusion

There is a obviously result that the top 3 features importance of all of them are the same, they are High Blood Pressure, Low Blood Pressure,Age. Random Forest have a better performance than Decision Tree because Random Forest would build many trees. As for Gradient Boosting, it uses gradient descent to optimize the model, so Gradient Boosting Tree can get a better performance than Random Forest.

Model Comparision

Linear SVM, Logistic Regression and Single Layer Perceotron

They have a similar performance. For logistic regression and single layer perceptron, they are also very similar in architecture, both two used sigmoid function to classify, the reason why they have different performance is because Logistic regression models a function of the mean of a Bernoulli distribution as a linear equation, but Perceptron doesn't use probabilistic assumptions for neither the model nor its parameter. As for Linear SVM and LR, both two are linear classfier, but SVM bases on distance and LR depends on probability, so SVM doesn't affect by data distribution, but LR does

Combination and Comparison of Multiple Machine Learning Algorithms

In [104]:
def table(name, param, score, cost):
    data = {'Name':name
            ,'Key hyperparameters tuned':param
            ,'Model performance':score
            ,'Estimate of time(n=nums,d=dimension,k=neighbor nums)':cost
            }
    index = data.keys()
    df = pd.Series([name, param, score, cost], index=index)
    return df
In [105]:
lr_table = table('Logistics Regression',lr_grid.best_params_,lr_grid.best_score_,'O(nd)')
ann0_table = table('ANN0',ann0_params,ann0_score[1],'-')
ann1_table = table('ANN1',ann1_params,ann1_score[1],'-')
ann2_table = table('ANN2',ann2_params,ann2_score[1],'-')
dr_table = table('Decision Tree',dt_grid.best_params_,dt_grid.best_score_,'O(n*log(n)*d)')
hw4_table = pd.concat([lr_table,dr_table,ann0_table,ann1_table,ann2_table],axis=1).T
hw4_table 
Out[105]:
Name Key hyperparameters tuned Model performance Estimate of time(n=nums,d=dimension,k=neighbor nums)
0 Logistics Regression {'C': 0.01, 'penalty': 'l2', 'solver': 'saga'} 0.728238 O(nd)
1 Decision Tree {'criterion': 'gini', 'max_depth': 2, 'min_imp... 0.716718 O(n*log(n)*d)
2 ANN0 {'epochs': 20, 'batch_size': 128, 'loss': 'bin... 0.729469 -
3 ANN1 {'epochs': 10, 'batch_size': 128, 'loss': 'bin... 0.736613 -
4 ANN2 {'epochs': 30, 'batch_size': 128, 'loss': 'bin... 0.734163 -
In [106]:
hw3_table = pd.read_csv('./hw3_table.csv')
In [107]:
index = ['Name','Key hyperparameters tuned','Model performance','Estimate of time']
table = pd.concat([hw3_table,hw4_table],axis=0,ignore_index=True,sort=False)
table =table.drop(["Unnamed: 0"],axis=1).reindex()
table.sort_values("Model performance", ascending =False)
Out[107]:
Name Key hyperparameters tuned Model performance Estimate of time(n=nums,d=dimension,k=neighbor nums)
9 ANN1 {'epochs': 10, 'batch_size': 128, 'loss': 'bin... 0.736613 -
10 ANN2 {'epochs': 30, 'batch_size': 128, 'loss': 'bin... 0.734163 -
2 LinearSVM {'C': 0.5} 0.731818 O(n)
3 Non_linearSVM {'C': 0.5} 0.731818 O(n)
8 ANN0 {'epochs': 20, 'batch_size': 128, 'loss': 'bin... 0.729469 -
6 Logistics Regression {'C': 0.01, 'penalty': 'l2', 'solver': 'saga'} 0.728238 O(nd)
1 KNN {'leaf_size': 30, 'n_neighbors': 30} 0.727278 O(knd)
7 Decision Tree {'criterion': 'gini', 'max_depth': 2, 'min_imp... 0.716718 O(n*log(n)*d)
0 Naive Bayes {'priors': (0.1, 0.9), 'var_smoothing': 1e-09} 0.703544 O(n * d)
4 Ramdom Forest {'priors': (0.1, 0.9), 'var_smoothing': 1e-09} 0.703544 O(n*log(n)*d*k)
5 Gradient Boosting {'priors': (0.1, 0.9), 'var_smoothing': 1e-09} 0.703544 O(n*log(n)*d*k)

Model Valiadtion

In [87]:
testing = pd.read_csv('./Disease Prediction Testing.csv')
testing_data = pd.read_csv('./Disease Prediction Testing.csv')
In [88]:
testing = cat2num(testing)
In [89]:
testing = testing.drop(['ID'],axis=1)
In [232]:
scaler = StandardScaler()
scaler.fit(testing)
testing_std = scaler.transform(testing)
dt_pred = best_dt.predict(testing_std)
lr_pred = lr_grid.predict(testing_std)
ANN0_pred = ANN0_model.predict_classes(testing_std)
ANN1_pred = ANN1_model.predict_classes(testing_std)
ANN2_pred = ANN2_model.predict_classes(testing_std)
In [233]:
result = pd.DataFrame({'ID':testing_data.ID
                       ,'Decision Tree':dt_pred
                       ,'Logistic Regression':lr_pred
                       ,'ANN0':ANN0_pred.ravel()
                       ,'ANN1':ANN1_pred.ravel()
                       ,'ANN2':ANN2_pred.ravel()
                      })
In [234]:
result
Out[234]:
ID Decision Tree Logistic Regression ANN0 ANN1 ANN2
0 0 0 0 0 0 0
1 1 0 0 0 0 1
2 2 0 1 1 1 1
3 3 1 1 1 1 1
4 4 0 0 0 1 1
... ... ... ... ... ... ...
20995 20995 1 0 0 0 1
20996 20996 0 0 0 0 1
20997 20997 0 1 1 1 1
20998 20998 1 1 1 1 1
20999 20999 1 1 1 0 1

21000 rows × 6 columns

In [235]:
result.describe()
Out[235]:
ID Decision Tree Logistic Regression ANN0 ANN1 ANN2
count 21000.000000 21000.00000 21000.000000 21000.000000 21000.000000 21000.000000
mean 10499.500000 0.40681 0.537476 0.543286 0.671190 0.743429
std 6062.322162 0.49125 0.498605 0.498135 0.469792 0.436751
min 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000
25% 5249.750000 0.00000 0.000000 0.000000 0.000000 0.000000
50% 10499.500000 0.00000 1.000000 1.000000 1.000000 1.000000
75% 15749.250000 1.00000 1.000000 1.000000 1.000000 1.000000
max 20999.000000 1.00000 1.000000 1.000000 1.000000 1.000000
In [236]:
result.to_csv('homework_4_YueyuanHe_results.csv')
In [ ]: